Diabetes is a disease that occurs when your blood glucose, also called blood sugar, is too high.
Context: This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Content: The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
Acknowledgements: Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.
Dataset taken from https://www.kaggle.com/uciml/pima-indians-diabetes-database
df = df_n.rename({'6': 'pregnancies', '148': 'glucose','72': 'bloodPressure', '35': 'skinThickness','0': 'insulin', '33.6': 'bmi','0.627': 'diabetesPedigreeFunction', '50': 'age','1': 'outcome'}, axis=1)
df.head(5)
| pregnancies | glucose | bloodPressure | skinThickness | insulin | bmi | diabetesPedigreeFunction | age | outcome | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
| 1 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
| 2 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
| 3 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
| 4 | 5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | 0 |
# get info about data: column names and number of rows
display(df.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 767 entries, 0 to 766 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 pregnancies 767 non-null int64 1 glucose 767 non-null int64 2 bloodPressure 767 non-null int64 3 skinThickness 767 non-null int64 4 insulin 767 non-null int64 5 bmi 767 non-null float64 6 diabetesPedigreeFunction 767 non-null float64 7 age 767 non-null int64 8 outcome 767 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
None
# data in figures
df.hist(figsize=(12,8),bins=50)
plt.tight_layout()
fig=go.Figure()
fig.add_trace(go.Box(y=pos["age"],name="pos",marker_color="blue",boxpoints="all",whiskerwidth=0.3))
fig.add_trace(go.Box(y=neg["age"],name="neg",marker_color="#e75480",boxpoints="all",whiskerwidth=0.3))
fig.update_layout(template="seaborn",title="Outcome Distribution by age",height=60)
fig.show()
plt.figure(figsize=(15, 6))
features = ['age','pregnancies','insulin','bmi','glucose','skinThickness','bloodPressure','diabetesPedigreeFunction',
'outcome']
corr = df[features].corr()
sns.heatmap(corr, square = True, annot=True, linewidths = 0.5, vmax = 0.2)
<AxesSubplot:>
C=abs(corr["outcome"]).sort_values(ascending=False)[1:]
print(C)
glucose 0.465856 bmi 0.292695 age 0.236417 pregnancies 0.221087 diabetesPedigreeFunction 0.173245 insulin 0.131984 skinThickness 0.073265 bloodPressure 0.064882 Name: outcome, dtype: float64
Most correlation for positive outcome: glucose, bmi and pregnancies. We can now look at the correlation between variables in more detail and also correlation between outcome and attributes
c = sns.jointplot(df['age'], df['glucose'], kind='reg')
c.fig.suptitle("Correlation between age and glucose")
Text(0.5, 0.98, 'Correlation between age and glucose')
pg = sns.jointplot(df['pregnancies'], df['glucose'], kind='reg')
pg.fig.suptitle("Correlation pregnancies and glucose")
Text(0.5, 0.98, 'Correlation pregnancies and glucose')
tr = sns.jointplot(df['bmi'], df['glucose'], kind='reg')
tr.fig.suptitle("Correlation between bmi and insulin")
Text(0.5, 0.98, 'Correlation between bmi and insulin')
th = sns.jointplot(df['diabetesPedigreeFunction'], df['glucose'], kind='reg')
th.fig.suptitle("Correlation between bmi and blood pressure")
Text(0.5, 0.98, 'Correlation between bmi and blood pressure')
ax = sns.violinplot(x="outcome", y="glucose", data=df, inner=None)
ax = sns.swarmplot(x="outcome", y="glucose", data=df,
color="white", edgecolor="gray")
C:\Users\SoumyaKrishnamurthy\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 28.6% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
ax = sns.violinplot(x="outcome", y="bmi", data=df, inner=None)
ax = sns.stripplot(x="outcome", y="bmi", data=df,
color="white", edgecolor="gray")
ax = sns.violinplot(x="outcome", y="diabetesPedigreeFunction", data=df, inner=None)
ax = sns.stripplot(x="outcome", y="diabetesPedigreeFunction", data=df,
color="white", edgecolor="gray")
sns.barplot(x="outcome", y="pregnancies", data=df)
<AxesSubplot:xlabel='outcome', ylabel='pregnancies'>
ax = sns.violinplot(x="outcome", y="diabetesPedigreeFunction", data=df, inner=None)
ax = sns.swarmplot(x="outcome", y="diabetesPedigreeFunction", data=df,
color="white", edgecolor="gray")
C:\Users\SoumyaKrishnamurthy\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 38.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. C:\Users\SoumyaKrishnamurthy\anaconda3\lib\site-packages\seaborn\categorical.py:1296: UserWarning: 9.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
Since this is a 0/1 prediction problem, we will be using Logistic Regression.
Now we split the data (account for oversampling) to training and testing datasets. The data will be normalized and we can apply machine learning algorithms ( I have tried 9 algorithms to compare)
The aim is to find the most accurate model.
start = time.time()
model_Log= LogisticRegression(random_state=10)
model_Log.fit(X_train,Y_train)
Y_pred= model_Log.predict(X_test)
end=time.time()
model_Log_time=end-start
model_Log_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy
print(f"Execution time of Logistic Regression model: {round((model_Log_time),5)} seconds\n")
#Plot and compute metrics
compute(Y_pred,Y_test)
Execution time of Logistic Regression model: 0.0286 seconds Precision: 0.653 Recall: 0.932 F1-Score: 0.768 Accuracy: 71.0 % Mean Square Error: 0.29
start=time.time()
model_KNN = KNeighborsClassifier(n_neighbors=15)
model_KNN.fit(X_train,Y_train)
Y_pred = model_KNN.predict(X_test)
end=time.time()
model_KNN_time = end-start
model_KNN_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy
print(f"Execution time of KNN model: {round((model_KNN_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of KNN model: 0.01687 seconds Precision: 0.669 Recall: 0.883 F1-Score: 0.762 Accuracy: 71.5 % Mean Square Error: 0.285
start=time.time()
model_RF = RandomForestClassifier(n_estimators=300,criterion="gini",random_state=5,max_depth=100)
model_RF.fit(X_train,Y_train)
Y_pred=model_RF.predict(X_test)
end=time.time()
model_RF_time=end-start
model_RF_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy
print(f"Execution time of RandomForestClassifier: {round((model_RF_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of RandomForestClassifier: 1.06807 seconds Precision: 0.704 Recall: 0.854 F1-Score: 0.772 Accuracy: 74.0 % Mean Square Error: 0.26
start=time.time()
model_tree=DecisionTreeClassifier(random_state=10,criterion="gini",max_depth=100)
model_tree.fit(X_train,Y_train)
Y_pred=model_tree.predict(X_test)
end=time.time()
model_tree_time=end-start
model_tree_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy
print(f"Execution time of DecisionTreeClassifier: {round((model_tree_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of DecisionTreeClassifier: 0.01006 seconds Precision: 0.615 Recall: 0.728 F1-Score: 0.667 Accuracy: 62.5 % Mean Square Error: 0.375
start=time.time()
model_svm=SVC(kernel="rbf")
model_svm.fit(X_train,Y_train)
Y_pred=model_svm.predict(X_test)
end=time.time()
model_svm_time=end-start
model_svm_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy
print(f"Execution time of SVC model: {round((model_svm_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of SVC model: 0.09016 seconds Precision: 0.651 Recall: 0.922 F1-Score: 0.763 Accuracy: 70.5 % Mean Square Error: 0.295
start=time.time()
model_ADA=AdaBoostClassifier(learning_rate= 0.15,n_estimators= 25)
model_ADA.fit(X_train,Y_train)
Y_pred= model_ADA.predict(X_test)
end=time.time()
model_ADA_time=end-start
model_ADA_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy
print(f"Execution time of AdaBoostClassifier: {round((model_ADA_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of AdaBoostClassifier: 0.1033 seconds Precision: 0.677 Recall: 0.835 F1-Score: 0.748 Accuracy: 71.0 % Mean Square Error: 0.29
start=time.time()
model_GB= GradientBoostingClassifier(random_state=10,n_estimators=20,learning_rate=0.29,loss="deviance")
model_GB.fit(X_train,Y_train)
Y_pred= model_GB.predict(X_test)
end=time.time()
model_GB_time=end-start
model_GB_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy
print(f"Execution time of GradientBoostingClassifier: {round((model_GB_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of GradientBoostingClassifier: 0.06016 seconds Precision: 0.662 Recall: 0.854 F1-Score: 0.746 Accuracy: 70.0 % Mean Square Error: 0.3
from xgboost import XGBClassifier
start=time.time()
model_xgb = XGBClassifier(objective='binary:logistic',learning_rate=0.1, max_depth=1, n_estimators = 50,colsample_bytree = 0.5,use_label_encoder=False, eval_metric='mlogloss')
model_xgb.fit(X_train,Y_train)
Y_pred = model_xgb.predict(X_test)
end=time.time()
model_xgb_time=end-start
model_xgb_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy
print(f"Execution time of model: {round((model_xgb_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of model: 2.02463 seconds Precision: 0.669 Recall: 0.845 F1-Score: 0.747 Accuracy: 70.5 % Mean Square Error: 0.295
start=time.time()
model_gnb = GaussianNB()
model_gnb.fit(X_train,Y_train)
Y_pred = model_xgb.predict(X_test)
end=time.time()
model_gnb_time=end-start
model_gnb_accuracy=round(accuracy_score(Y_test,Y_pred), 4)*100 # Accuracy
print(f"Execution time of model: {round((model_gnb_time),5)} seconds")
#Plot and compute metric
compute(Y_pred,Y_test)
Execution time of model: 0.04227 seconds Precision: 0.669 Recall: 0.845 F1-Score: 0.747 Accuracy: 70.5 % Mean Square Error: 0.295
#Plot accuracies
accuracies={"Logistic regression": model_Log_accuracy,
"KNN": model_KNN_accuracy,
"SVM": model_svm_accuracy,
"Decision Tree": model_tree_accuracy,
"Random Forest": model_RF_accuracy,
"Ada Boost": model_ADA_accuracy,
"Gradient Boosting": model_GB_accuracy,
"XG Boost": model_xgb_accuracy,
"Naive Bayes": model_gnb_accuracy}
acc_list=accuracies.items()
k,v = zip(*acc_list)
temp=pd.DataFrame(index=k,data=v,columns=["Accuracy"])
temp.sort_values(by=["Accuracy"],ascending=False,inplace=True)
print(temp)
Accuracy Random Forest 74.0 KNN 71.5 Logistic regression 71.0 Ada Boost 71.0 SVM 70.5 XG Boost 70.5 Naive Bayes 70.5 Gradient Boosting 70.0 Decision Tree 62.5
exe_time={"Logistic regression": model_Log_time,
"KNN": model_KNN_time,
"SVM": model_svm_time,
"Decision Tree": model_tree_time,
"Random Forest": model_RF_time,
"Ada Boost": model_ADA_time,
"Gradient Boosting": model_GB_time,
"XG Boost": model_xgb_time,
"Naive Bayes": model_gnb_time}
time_list=exe_time.items()
k,v = zip(*time_list)
temp1=pd.DataFrame(index=k,data=v,columns=["Time"])
temp1.sort_values(by=["Time"],ascending=True,inplace=True)
print(temp1)
Time Decision Tree 0.010059 KNN 0.016870 Logistic regression 0.028605 Naive Bayes 0.042267 Gradient Boosting 0.060158 SVM 0.090162 Ada Boost 0.103304 Random Forest 1.068065 XG Boost 2.024626
KNN is the best algorithm for both Accuracy and Time.
Execution time of KNN model: 0.00996 seconds Precision: 0.669 Recall: 0.903 F1-Score: 0.769 Accuracy: 72.0 % Mean Square Error: 0.28
F1-score should be close to 1 and Mean Squared error close to 0 Obviously there is plenty of room for improvement and academics have tuned this model by using better data and more attributes. It illutrates that not all machine learning algorithms can give very accurate results